First principles view on chemical compound space: Gaining rigorous atomistic control 

of molecular properties 
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A well-defined notion of chemical compound space (CCS) is essential for gaining rigorous control of 
properties through variation of elemental composition and atomic configurations. Here, we review an 
atomistic first principles perspective on CCS. First, CCS is discussed in terms of variational nuclear 
charges in the context of conceptual density functional and molecular grand-canonical ensemble 
theory. Thereafter, we revisit the notion of compound pairs, related to each other via "alchemical" 
interpolations involving fractional nuclear chargens in the electronic Hamiltonian. We address Taylor 
expansions in CCS, property non-linearity, improved predictions using reference compound pairs, 
and the ounce-of-gold prize challenge to linearize CCS. Finally, we turn to machine learning of 
analytical structure property relationships in CCS. These relationships correspond to inferred, rather 
than derived through variational principle, solutions of the electronic Schrodinger equation. 



I. INTRODUCTION 

In analogy to the vastness and sparseness of outer 
space, we can loosely refer to the space of chemical sys- 
tems as chemical compound space (CCS), i.e. some con- 
tinuous observable space that is populated by all epxcri- 
mentally and theoretically possible chemicals with inte- 
ger nuclear charges and interatomic distances for which 
chemical interactions occur—. Stated more precisely, CCS 
refers to the combinatorial set of all compounds that can 
be isolated and constructed from possible combinations 
and configurations of Nj atoms and N e electrons in real 
space. In absence of external fields and given N e and Nj 
atom-types {Zi} and spatial configurations {R/}, not 
only covalent, ionic, and metallic bonding result, but also 
the much weaker hydrogen and van-der-Waals-bonding, 
responsible for the physics and chemistry of molecular 
crystals, liquids, and other supra-molecular aggregates, 
can be derived, as well as all other quantum and statisti- 
cal mechanical properties such as electronic states, elec- 
tronic and vibrational spectra, free energies, and even 
phase diagrams and rare events such as chemical reac- 
tions. While most research efforts in this first princi- 
ples context have been dedicated to the approximations 
and methods necessary for making property predictions 
for given compounds, the focus of this tutorial is a first 
principles view on the compounds per se. 

Notwithstanding chemical bonding or conformations 
and merely considering the number of possible stoi- 
chiometries it is obvious that the size of CCS is unfath- 
omably large for all but the smallest systems. Due to all 
the possible combinations of assembling many and vari- 
ous atoms its size scales exponentially with compound 
size as cx Z^' ax . Here Z max is the number of possi- 
ble atom types, i.e. the maximal permissible nuclear 
charge in Mendeleev's table (Z max > 100), and Nj de- 
pends on the employed definition of "isolated system" 
but can certainly reach Avogardro's number scale for 
living organisms, chunks of unordered matter, or plan- 
ets. While many of such speculative compounds are 



likely to be unstable, the state of affairs worsens when 
accounting for the additional degrees of freedom which 
arise from distinguishable geometries due to differences 
in atom bonding or conformations. This combinatorial 
explosion with system size is the main motivation for ad- 
vocating an ab initio, or first principles, view on CCS, 
i.e. a view that restricts us to use solely {Zi} and {R/} 
as input variables^!, and, while maybe not free of pa- 
rameters, will not change in its parameterization as {Zi} 
and {R/} are freely varied^. A major part of mod- 
ern electronic structure theory and interatomic poten- 
tial work is concerned with the development of improved 
methods and approximations for solving Schrodinger's 
equation (SE) within the Born-Oppcnhcimcr approxima- 
tion for Hamiltonians relevant to materials, biological, 
or chemical research, and deriving properties thereof^. 
Ab initio statistical mechanics efforts are dedicated to 
sampling the corresponding 37Vj — 6 degrees of freedom 
from first principles^. In the context of CCS, the elec- 
tronic Hamiltonian H for solving SE, = , of 
any compound with a given charge, Q = N p — N e , is 
uniquely determined by its (unperturbed) external po- 
tential, v(r) — J2r Zi/\r — R/|, i.e. by its set {Rj.Zj}. 
Here, N p is the total number of protons in the sys- 
tem, i.e. the sum over all nuclear charges. Due to the 
Hohenberg-Kohn theorem we also know that the electron 
density n(r), and all electronic properties derived thereof, 
are determined by {Z/,R/}, up to a trivial constant, 
{H(r),N e } o {Zj,Rj,N e } O {v(r),N e } O n(r)£. 
Consequently, we work with {Zi, R/, N e }. 

In this tutorial, CCS is first briefly illustrated in terms 
of a rough energy scale in section [Til In section IIIII we 
will review the notion of a molecular grand-canonical en- 
semble density functional theory that can deal with frac- 
tional electrons and nuclear charges. Section [TVl will deal 
with pairs of chemical compounds, and with efforts to ex- 
ploit the arbitrariness of interpolating functions. It also 
details the challenge associated to a prize award of one 
ounce of gold. Finally, we will study recent efforts to 
use intelligent data analysis methods (machine learning) 
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to systematically infer analytical structure property rela- 
tionships from previously calculated electronic structure 
data sets in section IVl 
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FIG. 1: Exponential scaling of the total number of all 
possible TV partitions, i.e. stoichiometries, as a function of 
N p (Regime (i) in section [II]). Inset upper left-hand side: 
Young-Ferrers diagrams illustrating the possible partitions 
(stoichiometries) for N p as a function of number of atoms, 
Ni. The color code corresponds to the total number of pro- 
tons in the compound, N p € {1 (black), 2 (red), 3 (green), 4 
(blue)} Inset lower right-hand side: Total potential energy of 
relaxed molecules as a function of N p using interpolated pseu- 
dopotentials in analogy to Fig. [2] (all systems neutral, BLYP 
DFT level of theory^*-, arbitrary energy origin due to use of 
pseudopotentials) . 



II. ENERGY HIERARCHY 

Regarding CCS it is useful to think of a variable sys- 
tem that is comparable to the Mendeleev's table of the 
elements. Compounds, however, have many more di- 
mensions than a single atom's nuclear charge, specifically 
AN i — 6 (ANj — 5 if linear). One way of thinking about 
CCS is in terms of an abstract Gedankenexperiment in- 
volving all theoretically existing compounds. Consider 
all the compounds possible for a set of protons, subject 
to a varying amount of kinetic energy (or temperature) , 
provided by a thermostat. Regimes emerge of various 
familiar degrees of freedom. In this "phase diagram" of 
CCS these various regimes correspond to 

(i) stoichiometrical isomers: A fictitious "very high" 
temperature regime. Let us assume such high tem- 
peratures that all bonds break, and that all spatial 
degrees of freedom can safely be neglected. Fur- 



thermore, we assume isomers to have the same 
number of elementary particles, N p and N e . How 
many of such stoichiometrical isomers could be ob- 
served populating up to Nj < N p sites with at least 
one proton? Mathematically speaking, this is a dis- 
crete number theory problem: This number is the 
integer partition of N p , i.e. the number of ways to 
write N p as sum of positive integers. For exam- 
ple, CH 4 , NH 3 , H 2 0, HF, Ne represent only 5 out 
of all the 42 possible stoichiometrical isomers for 
N p = N e = 10. The total number of possible parti- 
tions corresponds to the partition function, which 
increases exponentially with N p . The exponential 
increase, and an illustration of the emerging stoi- 
chiometrics according to Young-Ferrers diagrams, 
are shown in Fig. [TJ These degrees of freedom are 
rarely explored in nature except when it comes to 
radioactive decay, nuclear fusion, or nuclear syn- 
thesis in the early stages of our universe. Through 
interpolation, however, we can meaningfully render 
this space continuous, as illustrated using density 
functional theory (DFT) for the potential energies 
displayed in the inset of Fig. [TJ 

(ii) constitutional isomers: "high" temperature regime. 
At high temperatures, only strong chemical bond- 
ing (covalent, ionic, metallic) survives. Corre- 
sponding Lewis structures enumerate many (but 
not all) of the possible constitutional isomers dis- 
tinguishable as possible topologies, or molecular 
graphs, that can be constructed. The enumer- 
ation (and canonization) of all possible constitu- 
tional isomers has been the focus of long stand- 
ing graph-theoretical efforts^—. The exponential 
scaling of their number is also evident for the re- 
cently published exhaustive list of small organic 
molecules^. This is the regime in which isomerism 
occurs through conventional "chemistry" , i.e. reac- 
tions that lead from one constitutional isomer to 
another, usually under the influence of pressure, 
temperature, light, or in the presence of some cat- 
alytic agent. We can model this and the subsequent 
regimes using ab initio molecular dynamics meth- 
ods^. Universal or reactive force-fields attempt to 
accomplish similar samplin g 13 ' 14 . 

(iii) conformational isomers: "ambient" temperature. 
Folding and un-folding events, sampling of in- 
tramolecular degrees of freedom, for example 
around dihedral angles and similar processes take 
place at "ambient" temperatures. These isomers 
are typically sampled using force-fields that as- 
sume fixed molecular topologies and parameterized 
charges, dihedral and angular terms, in addition to 
the typical potentials used for the chemical bond, 
such as harmonic, Buckingham's or Morse poten- 
tials. 

(iv) weakly interacting systems: "low" temperature. 
Supra-molecular assemblies, soft aggregates con- 
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dense to molecular liquids or solids. Typically mod- 
eled using classical effective Lennard- Jones type po- 
tentials. 

In the remainder of this review we will discuss re- 
cent contributions that are consistent with all of the four 
regimes, accounting for all the spatial and elemental de- 
grees of freedom {R/} and {Z/}. As also discussed be- 
low, we will ignore the electronic number as an indepen- 



dent variable since N e 
possible scenarios. 



N p = X^j %l f° r most if not all 



III. MOLECULAR GRAND-CANONICAL 
ENSEMBLE 

A. Theory 

Much of conceptual density functional theory (DFT) 
concerns the energy response to infinitesimal variations in 
number of electrons N e , or external potential, t>(r ) 15 ' 16 . 
While very important for interpreting orbitals, deriving 
reactivity indices, and even for rcdox-processes, the di- 
versity (and combinatorial scaling) of CCS is rather due 
to variations in nuclear charge distribution, than due to 
variations in N e . Consequently, for the following we will 
mostly be concerned with changes in nuclear charges. In 
order to offer a rigorous framework for explicit changes 
in {Z/}, molecular grand-canonical ensemble DFT was 
introduced^, relying to a significant degree on preceding 
wor k 18 ' 19 . Only a brief summary is given here, for more 
details the reader is referred to the original contributions. 

Assuming a classical nuclear charge distribution, n p (r), 
we can introduce an auxiliary grand-canonical variational 
energy functional for the aforementioned fictitious "very 
high" temperature regime (i), 



Q[N e ,n p (r)} = E[N e ,n p (r)} - y. e (J rfrn(r) - A^ e 

-Hp ( / dr n p (r) - N p ) . (1) 



Where E, n, fj, e , fi p correspond to the usual total potential 
energy functional, the electron charge density, and global 
electronic and nuclear chemical potentials, respectively. 
For high temperatures, entropy will prevail and the sys- 
tem would dissociate into Hn p - For lower temperatures, 
the potential energy will dominate the free energy, and 
the energy of a single atom (E(Z) m Z 2A = N 2A ) will 
dominate over the energy of many individual atoms that 
sum up to the same number of protons, J2j Zf 4 < N 2A . 
Hence, the nuclear charge distribution would collapse 
onto a single site. For this discussion, we assume that 
the classical and fictitious self-repulsion of protons occu- 
pying the same nuclear site is switched off. 

For the lower temperature regimes (ii)-(iv), the follow- 



ing energy functional is more meaningful, 

Q[N e ,n p (r)] = E[N e , n p (r)} - ^ (J dr n(r) - JV e ) (2) 



dri h (T)[n p (T)-^2Zi6{\T-B. I \) 



where Y2j Zi5(\fLi — r|) corresponds to the spatially re- 
solved nuclear charge distribution. The nuclear chemical 
potential fj, p is no longer a global parameter but rather 
a locally defined Lagrange multiplier. Using an external 
potential that excludes the aforementioned intra-nuclear 
self-repulsion of protons (here through use of an error 
function), we find for the corresponding Euler equation, 



M r ) 



5E 
Sn p (r) 



= E 



Z/erf [ct|R/ - r[ 
Ir-RH 



- / dt 



, n(v') 



(3) 



— the electrostatic potential of the system. As such, 
starting with Q — a Legendre transformed energy func- 
tional of intensive properties \x e and n P (r) — one can de- 
rive the Gibbs-Duhcm equation for electrons and protons 
and electrons, 



dQ[/j, e , M r )l = -N e dn e - J drn p (r) 5n n (r), (4) 

and obtain relationships between electronic hardness 
d 2 E/dN e , molecular Fukui function dfj, p (r) / dN^, and 
nuclear hardness, 6(j,p(r)/dn p (r)H. While the nuclear 
chemical potential is defined everywhere, its value at an 
atomic position quantifies the system's first order energy 
response to a fractional change of the atom's nuclear 
charge. Consequently, we dub fi p (Rj) the "alchemical 
potential" of atom /— . 



B. Interpolating pseudopotentials 

Apart from radioactive processes alchemical changes 
obviously do not occur in reality. They offer, however, 
a rigorous mathematical way to render CCS continuous. 
Alchemical changes and potentials involving fractional 
nuclear charges are commonly used for two, often related, 
purposes: Either for the evaluation of free energy differ- 
ences between different compounds, e.g. using thermody- 
namic integration^, AF = J dX (dE/dX); or for obtain- 
ing a set of gradients with dimension of Nj indicating the 
response of the system to a variation in nuclear charge on 



every sib 



,19.22 



In practice, we can calculate such changes 



through interpolation of nuclear charges in any basis set 
that is converged for all values of an interpolating order 
parameter, A. For plane- wave pseudopotential implemen- 
tations, the same can be accomplished by interpolation 
of pseudopotentials that replace the explicit treatment 
of the core electrons^—. The use of a plane-wave ba- 
sis set is advantageous since it is independent of atomic 
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position and type, and will not introduce Pulay forces^. 
The manipulation of pseudopotentials for affecting elec- 
tronic structure properties is nothing out of the ordinary. 
It has successfully been deployed for an array of prop- 
erties including relativistic effects 3 ^, self-interaction cor- 
rections,—^ exact-exchange and QM/MM boundary ef- 
fect a 33 ' 34 , van-der-Waals interactions 3 -^* 3 ^, and widening 
the band gap 37 ' 38 . For fractional nuclear charges we can 
interpolate pseudopotentials, and evaluate properties as 
a function of order parameter, < A < 1. An inter- 
polation of pseudopotential parameters as a function of 
nuclear charge is shown in Fig. [5J Calculated properties 
as a function of such alchemical changes are illustrated in 
Figs. [Hand [3] for total potential energies, and protonation 
energies and polarizabilities, respectively. Note that the 
former property is not physical because of the arbitrary 
energy offset of pseudopotentials. This, however, is in- 
consequential, since most of chemistry deals with energy 
differences, and differences thereof. As shown in Fig. [3] for 
HC1 — > NH 3 , the use of pseudopotentials for alchemical 
changes can be particularly advantageous when it comes 
to transmuting elements from different rows of the pe- 
riodic table while keeping constant the total number of 
valence electrons. 
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FIG. 2: Interpolation of Goedecker-Hutter pseudopoten- 
tial parameters for the BLYP based DFT calculations^ 1 ^. 
Parameters, shown as a function of nuclear charge, become 
polynomial regressions of third degree in Z for intervals of Z 
connected such that the sum is continuous and differentiable 
everywhere. 



C. Free energy applications 

Fractional charges were used to calculate free energy 
of solvation of ions in water—. Sulpizi and Sprik rig- 
orously explored the need for fractional nuclear charge 
calculations to obtain pK a 's of various organic and in- 
organic acids and basest. In the case of free energy 
differences, fractional charges can also be avoided all to- 
gether within a simple alternative and elegant interpola- 
tion scheme put forth by Alfe, Gillan and Price: Atomic 
forces are evaluated at both end-points (A € 0, 1), and 
A dependent molecular dynamics trajectories are gener- 
ated for atoms being propagated according to a linear 
combination of these forces using instantaneous A values 
as weights 4 ^. This is to be compared to a trajectory 
that uses Hellmann-Feynman forces directly evaluated 
on the interpolated alchemical species, such as for the 
Temperature limit of relaxing the geometry of a reac- 
tion barrier—. The limitations of Alfc's procedure are 
that (a) one requires twice as many self-consistent field 
calculations, namely for both end-points instead of a sin- 
gle one when using an alchemical interpolation (assuming 
of course that for both approaches the free energy inte- 
grand (dE/dX), varies similarly with A); and (b) that the 
number of atoms must be kept constant during the in- 
terpolation, significantly restricting the possible number 
of stoichiometries that can be explored. Both of these 
disadvantages can be avoided within the compound pair 
scheme discussed in section [TV] 



D. Design applications 

Through use of a Taylor expansion, truncated after 
first order, we can also exploit the vector of alchemical 
potentials for gaining control over properties in the iVj- 
dimensional space of all the nuclei that are adjacent in 
the periodic table. Wcigend, Schrodt and Ahhichs were 
probably the first to present such an application for the 
prediction of stability in binary atom clusters^. Subse- 
quently it was applied to drug binding energies within 
a QM/MM studyi^, demonstrated for hydrogen-bonded 
complexes, inter-converting methane, ammonia, water 
and hydrogen-fluoride while bound to formic acic-22, 
and applied to the molecular Fukui function for tuning 
HOMO eigenvalues of boron/nitride derivatives of ben- 
zene^. In Ref42 this notion is exploited for the predic- 
tion of reaction barriers as well as oxygen adsorption en- 
ergies on Pd79 derived core-shell metal nano-clusters that 
are catalyst candidates for the oxygen reduction reaction. 
The molecular Fukui function, in particular when evalu- 
ated at the position of the atom, was also discussed more 
recently by Cardenas et al.— . Clearly, truncation of the 
Taylor expansion at second or higher order would be de- 
sirable to increase the accuracy of the alchemical predic- 
tions of the effect of atomic transmutations. Alas, higher 
order derivatives of the energy with respect to nuclear 
charges lead to computational overhead, they require the 
calculation of the perturbed electronic structure. Never- 
theless, based on coupled perturbed self-consistent field 
theory the improved accuracy of higher order predictions 
was demonstrated very recently^ 7 .. 
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E. Other applications 

Instead of varying the proton distribution n p (r) = 
Zi5(\r — Rf|) in a compound, we can interpolate 
the external potential i;(r) = Y^i^i/l? — R/| just as 
well. Albeit mixing up spatial and compositional de- 
grees of freedom, this is more in line with conventional 
thought in quantum chemistry and conceptual DFT— 
where the nuclear charge distribution is hardly men- 
tioned explicitly. The route via the external potential 
has been pursued within the Ansatz of linear combination 
of atomic potentials in the research groups of Yang and 
Beratan 4 ^, assigning atom-type specific weights to every 
atom site. Using simplified Hamiltonians, impressive re- 
sults were obtained for the control of molecular hyper- 
polarizabilities 4 ^— , based on long-standing molecular 
design efforts for electronic properties well ahead of their 
tim o 54 ' 55 . This approach has also been adapted and ex- 
plored for the purpose of crystal structure design using 
DFT—. The functional second order derivatives with 
respect to external potentials have been published in 
Ref^i. Analytical expressions for second order deriva- 
tives and linear response functions have very recently 
been proposed by Yang, Cohen, De Proft and Geer- 
lings^. The same authors also derived important con- 
straints for the electronic structure that must be met by 
the exact exchange-correlation functional. In analogy to 
using constraints obtained for variable N e , such as piece- 
wise linear behavior and derivative discontinuities, in or- 
der to design improved density functionals^2r— , Cohen's 
current efforts are dedicated to variations in the external 
potentials that include fractional nuclear charges. The 
electronic structure for systems with Z — > oo has also 
been explored by Constantin et al££. 



IV. COMPOUND PAIRS 

A. Background 

The above discussed Ansatz, variational in a fractional 
nuclear charge distribution, defines an appealing, fully 
spatially resolved, index, i.e. a way to probe the sensi- 
tivity of a compound not only towards changes in any of 
its composing atoms but also with respect to adding new 
protons. However, for two reasons this approach can also 
be limited. First, severe constraints and preconceived in- 
sights arc required to explore the A^-dimensional space 
of all {Zj}. Either because if Zi is continuous it requires 
a bias potential towards integer numbers, possibly using 
a fictitious temperature, i.e. in analogy to the Fermi func- 
tion for electrons. Or if Z\ is a combination of various 
atom-types, i.e. in line with the aforementioned linear 
combination of atomic potentials approach 4 ^, the weight 
of one nuclear charge has to dominate so that it can safely 
be increased to 1, while decreasing all others to zero. Fur- 
thermore, constraints due to overall charge conservation, 
and electronic structure, have to be considered. For ex- 



ample, consider an alchemical transmutation of H2N-OH 
into its iso-electronic stoichiometrical isomer hydrogen 
peroxide, HO-OH, through simultaneously and continu- 
ously decreasing and increasing by one the nuclear charge 
of the hydrogen and nitrogen atom, respectively. At some 
point of this conversion, the spin of the ground state sur- 
face will turn into a triplet surface, therefore requiring 
the consideration of both spin states along the interpo- 
lation path. Second, and more importantly, in order to 
carry out alchemical changes along columns in the peri- 
odic table, for example, a path following Z would have to 
fill up the shell to go through the entire period before one 
arrives at the desired elements. This implies significant 
variations in electronic configurations just to arrive at a 
target compound with a configuration likely to be very 
similar to the starting compound. For example, consider 
a system of 8 valence electrons, and Ne and Ar as start- 
ing and target compounds, respectively. Then, an iso- 
electronic path progressing with Z of the central atom, 
and saturating with hydrogens accordingly, would have 
to proceed through the following series of compounds, 
NaH 7 , MgH 6 , AIH5, SiH 4 , PH 3 , H 2 S, and HC1, some of 
which not even likely to be covalcntly bound. Hence, 
while Taylor expansions in Z are quite predictive for ad- 
jacent elements — as mentioned in the preceding section — 
it is not surprising that their predictive power decays dra- 
matically when it comes to predictions for changes up and 
down the columns in the periodic table. Obviously, mat- 
ters only become worse when d- or /-elements have to 
be included, or when trying to make predictions by 2 or 
more rows down or upward. 
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FIG. 3: Protonation energy, E = £ - E , and static 

polarizability a of neutral species, as a function of order pa- 
rameter A driving X = HG into X = NH3. The inset shows 
the derivative of the protonation energy, once evaluated ana- 
lytically according to Hellman-Feynman in Eq. JS), and once 
from a quadratic fit to protonation energy. Heavy atoms 
of two end-point molecules were superimposed, their effec- 
tive nuclear charge being (1 — A)7+A 5, and hydrogens were 
placed in xy- plane. The protonating proton was placed in 2, 
1 A above the heavy atom. All energies calculated with PBE 
functional and interpolated analytical pseudopotentials^. 
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B. Theory 

Albeit intuitive, the use of nuclear charges as interpo- 
lating variable is fortunately not mandatory. Instead, we 
can also use a generalized, and entirely arbitrary, interpo- 
lation procedure between any two pairs of compounds — 
as long as it is reversible and integrable, any path can be 
used to monitor any property that is a state function^. 
In all of the following, we will only consider interpola- 
tions between iso-electronic compounds, i.e. compounds 
with the same N e in their Hamiltonian. As mentioned 
before, this is only a minor restriction because the diver- 
sity of CCS is by far not as much due to differences in N e 
as it is due to differences in nuclear charge distribution. 
We can directly interpolate the nuclear charge distribu- 
tions/external potentials/Hamiltonians of any two such 
iso-electronic compounds, A and B, e.g. 

H(X,v) = H a (v) + X(H b (t)-H a (t)), (5) 

interpolating linearly and globally in order parameter A. 
H A and Hb denote the initial and final electronic Hamil- 
tonian of the two compounds, defining the corresponding 
boundary conditions H(X = 0) and H(X = 1), respec- 
tively For any iso-electronic Hamiltonian linear in A, 
the potential energy is not necessarily linear. In fact, 
the electronic potential energy of a linearly interpolated 
Hamiltonian will always be concave, i.e. equals or larger 
than a straight line connecting the energies of compound 
A and B, E Hn = E A - X(E B - E A ). This inequality 
follows from the variational principle and can easily be 
shown: Eq. [5] implies, 

E[H(X)} = (H(X)) X = E A [n x ] + X(E B [n x ]-E A [n x ]), 

(6) 

where (...)a now correspond to the usual quantum me- 
chanical Bra-Ket notation, denoting the expectation 
value with the wavefunction, or the density functional 
(in an orbital- free exact DFT world), evaluated for the 
Hamiltonian at A, i.e. E[H(X)} = (V X \H {X)\V x ) = 
(H(X)) X = E x [n x ]. E A [n x ] and E B [n x ] denote the en- 
ergies of compound A or B evaluated using the wave- 
functions (or density in the case of orbital free DFT) 
obtained at A. Note that n x= Q = ua, and n x= i = i%b- 
Subtracting E Un (X) and regrouping yields, 



E[H{X)] -E lm (X) 



= {E A [n x ] + X{E B [n x ]~ E A [n x ]) 
- {E A [n A } + X{E B [n B ] - E A [n A ])) 

= (1 - X)(E A [n x ] - E A [n A ]) 
+X{E B [n x ] -E B [n B ]), 

> 0. (7) 



where the prefactors of the energy differences A, (I — 
A) > by definition, and where -E^I^a] < ^a[«a] 
and Eb[tib] < Eb[u x } because of the variational prin- 
ciple. Consequently, analogous inequalities will hold for 
any property for which there is a variational principle, 



e.g. also for the polarizability due to Pearson's maxi- 
mum hardness principle^ 3 -. This inequality is on display 
for the static polarizability, fractionally transmutating a 
hydrogen chloride molecule into ammonia (Fig. [3]). Sim- 
ilarly, potential energy inequalities in between different 
molecules were proposed by Mezey in the eighties^. 

Analytical first order derivatives of the energy as a 
function of some iso-electronic change in the Hamil- 
tonian can easily be calculated using the Hcllmann- 
Feynman (HF) theorem 6 -^, as proposed and demonstrated 
for HOMO eigenvalues in Ref^, dE/dX = (8H/dX) x . 
For a linearly interpolating Hamiltonian, such as in 
Eq. ([5]), this leads to, 



dE[H(X)} 
OX 



(Hb-H a ) ; 



dr n x (r) x (v B (r) ~ V A (r)) 



= (H B ) X -(H A ) X = E B [n x ] - E A [n x ]. 



(8) 



The protonation energy, and its derivative, also feature in 
Fig-Ofor the same transmutational change, HC1 — >• NH3. 
As mentioned before, the use of pseudopotcntials/valcncc 
electron densities fortunately renders straightforward the 
application of the HF theorem according to Eq. ([8]) even 
for changes that involve elements from differing rows in 
the periodic table. 

Thermodynamic integration of dE/dX over A yields 
any properties related to free energy differences. In the 
case of compound design the approach is slightly differ- 
ent, we would like to expand the energy of a new com- 
pound B in terms of a reference compound A and its 
derivatives, 



Eb ~ E A [n A ] - 



oX 2 0X 2 



-HOT, 
(9) 



HOT standing for higher order terms, and AA = I. 
Unfortunately, when making predictions with a lin- 
early interpolated Hamiltonian, the first order deriva- 
tive term according to Eq. ([5]) is not necessarily pre- 
dictive^. Unfortunately, the inclusion of higher order 
derivatives in Eq. (|9]) might not only improve the predic- 
tion, as found for statistical mechanical averages^, but 
it also requires the evaluation of the perturbed wavefunc- 
tion, e.g. through the use of linear response theor y 33 ' 68 , 
thereby defying the original purpose of predicting a new 
compound's energy without having to calculate its wave 
function. Nevertheless, for external potentials this has 
been carried out within conceptual DFT—, and very re- 
cently even analytically^ 8 -. 

In order to improve the predictive power of the first 
order term in Eq. (j9)), an empirical correction has been 
introduced that "linearizes" the energy through a global 
yet non-linear Hamiltonian, H(X) = H A + f A b{X){H b — 
H A )^. If we assume /(A) to be a second order polyno- 
mial in A, two coefficients are determined by the bound- 
ary conditions that H(X = 0) = 0, and H(X = 1) = I, 
leaving one additional degree of freedom. We can obtain 
the remaining degree of freedom as a parameter from an 
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arbitrary second iso-clcctronic compound pair, CD, such 
that the energy becomes linear in A. The resulting ex- 
pansion up to first order in Eq. ([9]) then becomes, 

E A « BxM+C^^MaA. (10) 

Here, C^fy is the ratio between the energy difference and 
the HF derivative of the additional reference compound 
pair, C and D (its Hamiltonian being linearly interpo- 
lated), and d\EA[nA\ is determined according to Eq. © 
as Eb[ua\ — Ea[ua\- This bears resemblance to a long 
tradition in physical chemistry, namely the use of refer- 
ence compounds for electrode potentials or enthalpies. 

The idea to use alternative, non-linear, interpolations 
is not new within the molecular mechanics research. In 
the context of electronic structure theory non-linear al- 
chemical paths were also explored for chemical binding^ 9 -, 
and nuclear quantum effects 7 -^. Various open questions 
deserve further investigation, such as transferability and 
choice of reference coefficients, iso-electronic changes us- 
ing valence electrons only versus all electron description, 
non-iso-electronic changes, necessary accuracy when pro- 
viding the input of target compound B. i.e. also its ge- 
ometry, ionic forces of B, etc. These answers are likely 
to depend on systems and properties. 



C. Control of ligand binding 

In this section, we exemplify the use of the reference 
coefficients [Eq. (jlOp ] for increasing the predictive power 
of the HF derivatives of linearly interpolated Hamilto- 
nians. We refer to state-of-the-art van dcr Waals cor- 
rected DF T 35 i 71 to accurately estimate interaction en- 
ergies with binding targets across CCS. We will con- 
sider a small yet illustrative set of mutants of the el- 
lipticine molecule. Ellipticine is a naturally occurring 
anti-cancer drug with various binding targets. As also 
illustrated in Fig. @, its dominant mode of binding to 
DNA is intercalation. Structural data as well as stud- 
ies on drug analogues are readily availabl e 72 ' 73 . We will 
probe the versatility of the linearizing scheme for control- 
ling ellipticine-derivatives/DNA binding 7 ^. Clearly, for 
the eventual control of ligand binding the property of in- 
terest is not the potential energy of interaction but rather 
the free energies of binding: Solvation or entropic con- 
tributions can be crucial, as is well known in general^, 
and in the particular case of ellipticine^. For example, 
Tidor— , and Oostenbrink and van Gunstere n 78 ' 79 have 
carried out similar work in the sense of interpolating lig- 
and candidates, by calculating free energies of binding, 
and using molecular force-fields. However, here we will 
focus on the potential energy of interaction. Subsequent 
work in the future can deal with the inclusion of thermal 
and solvent effects for instance using ab initio molecular 
dynamics techniques^ in conjunction with QM/MM^i 
calculations. Moreover, even at the mere potential energy 




FIG. 4: TOP: Cluster model of drug intercalated in be- 
tween two Watson-Crick base-pairs connected by sugar puck- 
ers and phosphate groups. BOTTOM: Neutral "wild-type" 
ellipticine, Ri denote sites of groups permitted to mutate (see 
TAB. H]| . 



electronic structure level of theory, the accurate quantifi- 
cation and control of intercalated ellipticine derivatives is 
challenging: vdW forces dominate the binding. Recent 
studies have already explored the binding of ellipticine 
and how its vdW forces can be accounted for at the em- 
ployed electronic structure level 8 ^— . 

Let us consider the intercalation energy for the com- 
plex depicted in Fig. (gj) for mutations at the five sites 
indicated in the bottom panel. In analogy to protein or 
DNA sequences, an (arbitrary) relevant subspace of CCS 
is defined in TAB. ((I]) as a matrix that corresponds to an 
alphabet of iso-electronic (in valence electron number) 
functional groups at each of the selected sites. Note that 
variation in molecular combinations of letters of this al- 
phabet are capable to not only revert dipole-momcnts, 
they can also act either as hydrogen bond acceptors 
(lone pair in OH/C1) or donors (NH2, proton in OH). 
Clearly, the alphabet can easily be extended to accom- 
modate further effects, for example with electron donat- 
ing/withdrawing or hyperconjugating groups etc. Con- 
formational degrees of freedom can be encoded explicitly, 
as it is done for the hydroxyl groups in TAB. ((XJ) . 

Within this restricted CCS, any given molecule is rep- 
resented by the sequence of functional groups distributed 
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TABLE I: Exemplary alphabet for mutants of ellipticine as 
oriented in Fig. [31 denning a CCS with 4x6 = 5184 molecules. 
Highlighted in red are all functional groups whose mutations 
have been considered. Predictions are displayed in Fig. [S] 
The "wild-type" ellipticine drug is encoded as (21121) with 
three functional groups coming from the first column, and the 
functional groups at site Ri and R4 coming from the second 
column. 



site vs. group 


1 


2 


3 


4 


5 


6 


Ri 


CH 


N 


SiH 


P 






R 2 


CH 3 


NH 2 




OH rigM 


F 


CI 


R 3 


CH 3 


NH 2 


OH lctt 


Qjjr.ght 


F 


CI 


R 4 


CH 2 


NH 





SiH 2 


PH 


s 


R 5 


CH 3 


NH 2 


Qjjlett 


Qjjr.ght 


F 


CI 



over the five sites. For example, the "wild-type" ellip- 
ticine in Fig. (j4]) would correspond to (21121), i.e. 2 for 
N at Ri, 1 for CH 3 at R 2 , R 3 , and R 5 , and 2 for NH 2 
at the R4. Let us exemplify a DFT+vdW based predic- 
tion of the binding energy of another mutant: (21121) 
is predicted to bind to the DNA cluster in Fig. 0] with 
£21121 = —38.5 kcal/mol^ 2 -. For predicting the single 
point mutation (21121)^(21125) (changing CH 3 into F 
at R5), one would have to predict a target value of £21125 
= —36.9 kcal/mol. The derivative based prediction ac- 
cording to first order term in Eq. © is calculated to be, 
£21125 ~ £'21121 + d\E2\i2\ = —38.5 + 1.4 = —37.1 
kcal/mol. Inclusion of reference coefficient [Eq. (fT0|)]. 
and using compound pair (11121) / (11125) as a reference, 
yields £21125 ~ £21121 + C ref x ^£21121 = -38.5 + 
1.3x1.4 = -36.7 kcal/mol. 

In order to gain a more representative idea of the pre- 
dictive power of this method, Fig. [5] features the outcome 
for a small subspace of the CCS highlighted in red in 
TAB.IU Eight compounds have been considered involving 
permutations at Ri, R4, and R5, each with two possible 
functional groups. Predictions based on all the possi- 
ble derivatives among these compounds, with and with- 
out reference coefficients (as obtained from compound 
pairs not involved in the transmutation), are compared 
to calculated binding energies. Despite the several out- 
liers that deviate substantially, the use of reference com- 
pounds dramatically improves the overall prediction. 

For comparison, we also include predictions based on 
the additive assumptions that the influence of the rest of 
the molecule cancels when considering an interpolation 
for the same pair of functional groups. Specifically, we 
estimate the binding energy of B simply by adding the 
difference in binding energy of a reference compound pair, 
CD, to the binding energy of A, 

E B w E A + AE = E A + {E D -E C ) (11) 

As shown in Fig.O also this prediction yields remarkable 
good correlation — with less pronounced outliers. Analy- 
sis of the distribution of errors, however, suggests that 
in spite of the outliers the normal distribution of predic- 
tions around the ideal correlation is superior for prcdic- 




-38 -37 -36 



E [kcal/mol] 




E c -E [kcal/mol] 

FIG. 5: TOP: Correlation for eight compounds from alpha- 
bet in TAB. [T] Predictions made using first order derivatives 
only (dE/dX, Eq. ©), energy difference of reference com- 
pound pairs (AE, Eq. |TTJ)), or C ref dE/d\, Eq. {TO}. BOT- 
TOM: Normalized histogram and corresponding normal dis- 
tribution of error over 72 predictions, A c - p E mt = calculated 
E int - predicted K nt . 



tions made with the product of derivative and reference 
coefficient (Bottom of Fig. [5]). 



D. Win a prize 

The numerical illustrations in the previous section, as 
well as in Refi££, suggest that efforts to linearize the prop- 
erty through use of alternative, non-linear interpolations 
of the Hamiltonians are worthwhile. Strictly speaking, 
however, due to the use of reference compound pairs, the 
aforementioned interpolation constitutes no longer a first 
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principles but rather an empirical and heuristic Ansatz. 
What is needed instead, is an ab initio interpolating pro- 
cedure that linearizes the energy (or other properties) in 
order parameter, such that the first order Taylor expan- 
sion based on the HF derivative is sufficiently accurate 
to predict properties of other compounds^. 

Inspired by Erdos' habit to offer cash awards for solu- 
tions to outstanding mathematical problems, the author 
has thus decided to offer the equivalent of an ounce of 
gold to the first person who presents an ab initio solution 
to this problem. Specifically, the challenge reads: Find — 
or show non-existence of — an ab initio, i.e. valid for any 
external potentials, interpolating transform /as (A) for 
which two different but iso-electronic molecular Hamilto- 
nians with energies Ea and Eb interconvert such that the 
electronic ground state potential energy E = (if (/(A))), 
is linear in order parameter A, and that consequently the 
HF derivative is given by, 



dE(X) 



dX 



9H(f(X)) 
dX 



= Eb — Ea- 



(12) 



Here, < A < 1, and E(X = 0) = (H(f(X = 0))) = 
(H A ) = E A , and E(X = 1) = (H(f(X = 1))) = (H B ) = 
Eb- Further details can be found in footnote^. 

We can exemplify the challenge by solving it for the 
non-relativistic hydrogen-like single atom with only one 
electron. In this case, -E'(A) = aZ(X) 2 , where a is a con- 
stant, and the nuclear charge Z is a function of interpo- 
lating parameter A^^. For an interpolation linear in the 
Hamiltonian, Z(X) = Za + X(Zb — Za), and the energy 
is therefore clearly quadratic in A. The desired behavior 
of a linearized energy would be, 



E (X) = E(Za) + X(E(Z b )-E(Za)) 



a(Z\ 



X{Z% - Z A ))- 



Equating this to aZ(X) 2 and solving for Z(X) yields the 
corresponding interpolating function: 



Z(X) = y/z A + \(Z%-Z? A ). 



(14) 



As suggested above in the challenge, we can test this in- 
terpolation to confirm if indeed we find the desired slope 
for the linearized energy, Eb — Ea- Application of the 
chain rule, and insertion and differentiation of Eq. (|14p 
confirms the expected result, 



dE_ 
~dX 



dE dZ(X) 



dZ dX 
a(Z 2 E 



2aZ(X) 



A B ~ A 



Z 2 A ) 



E, 



2^Z 
E A - 



X{Z 



drive all atoms in compound A to atoms in compound B 
while linearizing the potential energy. 

Note that a naive extension of Eq. (fLfj) to assemblies 
of atoms, 



dE 
~8X 



^ dE dZ(X) ^ .dZ(R! 



A) 



ieA.B 



l£A,B 



dX 



(16) 



does not constitute a practical approximate solution to 
the challenge. /j, p (Rj) denotes the "alchemical" potential 
mentioned above which corresponds to the electrostatic 
potential at R/ without the repulsion due to Zj. fj, p will 
not necessarily cancel the square root term in the denom- 
inator of the derivative in Eq. (JUJ), which consequently 
diverges if A and Za(R-i) equal 0. 



V. STATISTICAL METHODS 
A. Inductive reasoning from first principles 

Within statistical mechanics the numerical prediction 
of macroscopic observables from atomistic simulation re- 
quires repeatedly calculating microscopic states, using 
electronic structure theory, atomistic or coarse-grained 
force fields, and averaging in an appropriate ensemble. 
Philosophically speaking, the exercise of performing such 
computational "experiments" is an application of deduc- 
tive reasoning to increase knowledge. But also when ex- 
ploring CCS in terms of ensembles of potential energy 
hypersurfaces by repeatedly solving SE for TV different 
compounds deductive reasoning is at work. Since the 
size of CCS is prohibitively large, its exhaustive explo- 
ration through screening with SE is impossible. While 
some interpolating A schemes use statistical mechanics 
(13) for a preselected set of compound a 77 ' 78 , a rigorous way 



As such Eq. ([14]) linearizes the energy in A. The challenge 
of the prize consists of finding an analogous expression 
for molecules, i.e. a spatially resolved and A dependent 
distribution of nuclear interpolations, {Z(R/,A)}, that 



to systematically and generally gain quantitative insights 
is desirable. This task can be accomplished through the 
application of inductive reasoning. 

Historically, the role of inductive reasoning for chem- 
istry is considerable, Mendeleev's table, the Hammett 
equation, or Pettifor's structure maps 8 ^— are all based 
on inferred phenomenological relationships. Further ex- 
amples include widely spread rules and notions of chem- 
istry, such as the chemical bond, atomic charges, or aro- 
maticity. While popular and useful to the experimental 
chemist conventional quantum chemistry, based on de- 
ductive reasoning, is still struggling to account for these 
notions^. Recent advances in statistical data analysis 
, methods^— and applications in other areas of science 
a and engineering, such as searching the internet, auto- 
(15)mated locomotion (self-driving cars), algorithmic trad- 
ing, or brain-computer interfaces, strongly suggest that 
they will also play an increasingly important role in 
chemistry. Examples of first efforts to quantitatively in- 
fer laws for atomistic simulations include "Learning On 
The Fly"—, or "force- matching" 95 i 96 . More sophisticated 
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statistical learning methods have been applied to the 
training of exchange correlation functional in density- 
functional theor y 97 ' 98 , or to parameterizing interatomic 
force fields^-i^. Support vector machines have been 
shown to quantify basis-set incompletenesses 5 .. Gaus- 
sian kernel based machine learning (ML) for very ac- 
curate reactive force-fields was introduced by Bartok et 
alJ£&. Contributions by Curtarolo, Hautier, and Coder 
combine data-mining with mean-field electronic structure 
theoryiffiiSii. Even the learning of reorganization ener- 
gies that enter Marcus charge transfer rates are promis- 
in g 110 ' 111 . Very recently, kernel based ML models have 
also delivered promising results for learning electron den- 
sity functionalsJ^, or transition state theory dividing 
surfaces that determine reaction rates^ 3 -. Bayesian error 
estimates and cross-validation methods have also been 
applied to the development of exchange-correlation mod- 
els with controlled transferability^. 



Within the bioinformatics and chcminformatics com- 
munities the development of quantitative structure prop- 
erty relationships (QSPRs) has a long tradition. QSPRs, 
relying on similar statistical frameworks (ML, cross- 
validated training, principal component analysis, etc.), 
deliberately attempt to circumvent solving the under- 
lying laws of physics by directly correlating system pa- 
rameters (descriptors) with macroscopic properties of in- 
terest. Conventionally, QSPRs are based on descriptors 
that explicitly forsake atomic resolution in the first prin- 
ciples sense. A large variety of such QSPR descriptors 
for various properties has been proposecUi^ii 8 -. Two 
such descriptors, the molecular signature by Faulon and 
coworkers^ and a combination of HOMO eigenvalues 
of charged and neutral species, have recently yielded 
promising results for the QSPR modeling of a first prin- 
ciples property, the reorganization energy, in the CCS of 
polycyclic aromatic hydrocarbons (PAHs)iii. PAHs are 
used in discotic liquid crystals which self-assemble into 
columnar liquid crystal structures, implying their useful- 
ness for organic photo- voltaic applications^ .. 



In this section we will discuss the application of ab ini- 
tio statistical learning approaches to previously obtained 
first principles data for N compounds. Merely based on 
the data, QSPRs can be inferred, i.e. "learned", and 
subsequently be used to avoid the cumbersome task of 
having to explicitly model all the underlying physical de- 
grees of freedom of electrons and nuclei. As such, ML es- 
timates solutions of SE for a new, i.e. "unseen" , molecule 
B simply by evaluating an analytical expression E est (B) 
that (explicitly or implicitly) encodes the data of N other 
molecules. Obviously, any such inferred relationships are 
inherently limited in accuracy by the quality of the data 
used for training. 



B. Machine learning in CCS: The quantum 
machine 

Recently, a kernel ridge regression approach to learn 
DFT atomization energies across CCS has been intro- 
duced^. Unlike ordinary QSPR approaches, this ML 
model is free of any heuristics. It exactly encodes the 
supervised learning problem posed by SE, i.e. instead 
of finding the wavefunction ^ which maps the system's 

Hamiltonian to its energy, H({Zi, R/}) i — > E, it di- 
rectly maps system to energy based on N examples given. 
In the limit of converged N, i.e. sufficiently dense sys- 
tem coverage, the ML model is therefore a formally ex- 
act inductive equivalent to the deductive solution of SE 
through use of approximate wave-functions (such as sep- 
arability of nuclear and electronic wavefunction or sin- 
gle slater determinants), Hamiltonians (such as certain 
exchange-correlation potentials) , and self-consistent field 
procedure to minimize the energy. In ReLi2I numerical 
evidence is given for this idea. Specifically, for a diverse 
set of organic molecules, one can show that a ML model 
can be used instead, {Z/,R/} t—h- E. After training, 
solutions to SE can be inferred for out-of-sample, i.e. 
"unseen", compounds that differ either in geometry or 
in composition or in both. The evaluation of an esti- 
mate is ordinarily negligible in terms of computational 
cost, i.e. milli seconds instead of hours on a conventional 
CPU, while yielding an accuracy competitive with the 
deductive approaches of modern electronic structure the- 
ory. As within any inductive approach, the accuracy is 
limited by the domain of applicability as defined by the 
data used for training, i.e. robust results can only be ex- 
pected in interpolating regimes with sufficient coverage. 
Within the Gaussian kernel model, the energy of a query 
molecule is given as a sum over N molecules in the 
training set, 

N , 

E est (M A ) = ^a, e . (17) 

i=l 

Each training molecule i contributes to the energy ac- 
cording to its specific weight o^, scaled by a Gaussian in 
its distance to M^, d(M.A, Mj). For given length-scale 
a and regularization parameter A, {on} are obtained by 
solving the regression problem, 

nun Yl( EeSt ( M i)- E i ef ) 2 + *J2al (18) 

i i 

a and A are hyperparameters. This regularized model 
limits the norm of regression coefficients, {c^}, thereby 
improving the transferability of the model to new com- 
pounds. All regression coefficients and hyper-parameters 
are determined by cross-validation on data stratified 
training set a 92 ' 93 . 

So far this model has been trained and validated only 
in its most rudimentary form for atomization energies 
of a small set of interesting compounds. Specifically, 
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molecular atomization energies at the hybrid DFT level 
of theory£ii22r-i2£ have been used for training on up to 
N ss 7000 molecules from the GDB data base^ (see 
Fig. [U for an illustration), for which mean absolute errors 
of less than 10 kcal/mol have been obtained. The choice 
of hybrid DFT is motivated by relatively small errors 
(< 5 kcal/mol) for thermo-chemistry data that includes 
molecular atomization energies^. While 10 kcal/mol is 
still far from "chemical accuracy" (« 1 kcal/mol), more 
recent progress has not only led to atomization errors 
with less than 3 kcal/mol accuracy^!, but also includes 
other electronic properties, such as frontier eigenvalues, 
polarizability, and excitation energiesi^. 

An appealing advantage of analytical models, in- 
dependent if obtained from physical insight or sta- 
tistical regression, is their amenability to physical 
analysis. For example, unlike electronic structure 
methods, otherwise ill-defined concepts such as dis- 
tance/neighborhood/similarity in CCS can now be quan- 
tified within the "world" of the ML model. Specifically, 
Eq. (|17l) gives the energy of a query molecule as 
an expansion in compound space spanned by reference 
molecules {Mi}: The regression weights {«i} are scaled 
by the similarity between query and reference compound 
as measured by a Gaussian of the distance. Hence, 
at assigns a positive or negative weight to molecule i. 
Within the compound space used as reference, molecules 
therefore can be ranked according to their |a|. How- 
ever, since {cti} are regression coefficients in a non- linear 
model, i.e. after a non-linear transformation of the train- 
ing data, the resulting energy contributions are specific 
to the employed training set without general implications 
for other properties or regions of compound space. The 
locality of the model is measured by a, enabling the def- 
inition of a critical distance of locality, d c , i.e. only if 
d(M.A, Mi) < d c will Mi contribute to the energy of M^ 
more than some threshold energy E c . Rearranging sum- 
mands in Eq. ((T7|) leads to d c (M A , M,) = <jy / 2\n{a l /E c ]. 
For atomization energies, and the chemical space consid- 
ered in Ref»i2i, i.e. with a critical distance < 400 Bohr 
(see TOP of Fig. [6]) the ML results suggest that the model 
becomes local when a < 60 Bohr, for the average a, and 
for E c = 1 kcal/mol. Such a values are achieved when the 
number of molecules in training set N exceeds ~5000. In 
other words, for N < 5000, the model is global, i.e. all ref- 
erence compounds contribute with more than 1 kcal/mol 
to any prediction made. See BOTTOM of Fig. H for the 
N dependence of a and A. 



C. Coulomb matrix descriptor 

To represent compounds, a wide variety of "descrip- 
tors" is in use by statistical methods for chem- and 
bio-informatics applicationsii^ii^. The descriptor intro- 
duced by Rupp ct alJ^i is based solely on coordinates 
and nuclear charges, and dubbed "Coulomb-matrix" , M, 
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FIG. 6: TOP: Distance distribution in GDB-13 for the 7000 
smallest molecules, AiJ rcf = \E{ — Ej\ versus d(Mj, Mj). 
BOTTOM: N dependence of a and A. 

a symmetric square matrix of Nj x Nj dimensions, 



2.4 



M, 



i j 



ZiZj 
k |R 7 - 



V / = J, 



(19) 



The diagonal elements, E{Z{) rj 0.5Zj' , correspond 
to a polynomial fit to free atom energies^. The off- 
diagonal elements correspond to the Coulomb repulsion 
between atoms / and J. For a data set containing 
molecules with differing number of atoms, all the {M} of 
all the smaller systems are extended by zeros until they 
reach the dimensionality of the largest molecule in the 
training set. The Coulomb-matrix can easily be extended 
to account for extended or condensed phase systems: Let 
Nj be the number of atoms in the unit cell, and let Ni 
be the number of atoms in unit cell plus sufficiently large 
surrounding environment, then define Mjj as above ex- 
cept that all off-diagonal elements are set to zero for all 
I and J larger than Nj. 

We can measure the distance between two molecules by 
the Euclidean norm of their diagonalizcd Coulomb ma- 
trices: d(M A ,M B ) = d(e A ,e B ) = ^J2i \ e ieA - e/es| 2 , 
where e are the eigenvalues of M in order of decreas- 
ing absolute value. The physical meaning of represent- 
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ing CCS in this way can easily be understood by con- 
sidering the simplest of all molecules, homo-nuclear di- 
atomics (i.e. Z = Z\ = Z2 and r = |Ri — R^D- 
Any corresponding M is then simply defined by its two 
eigenvalues, the roots of its characteristic polynomial, 
e i/2 = 0.5Z 2A ± Z 2 /r. When measuring similarity be- 
tween two such diatomics with different interatomic dis- 
tances, ta and TBi the measure of similarity reduces to 
d(M A ,M B ) = V2Z 2 (r B - r A )/(r A r B ); and the corre- 
sponding estimated potential energy curve for any new 
interatomic distance, r A , as trained on N other inter- 
atomic distances, {r^}, is given by 



E est (r A ) 



N 

E 

i=l 



ai exp 



ta) 



(20) 




FIG. 7: Sketch of two homometric molecules (same stoi- 
chiometry, same sum of interatomic distances) from Refji^. 
The Coulomb-matrix (sorted or a set of its permutants) can 
distinguish these two molecules^^-^. 



D. Alternative descriptors for CCS 



In complete analogy, a ML model of the homo-nuclear 
dimer can also analytically be understood in terms of 
other homo-nuclear dimers with differing atomic num- 
bers, hetero-nuclear dimers, or hetero-nuclear trimers. 
The ease of differentiation with respect to not only ge- 
ometry (d r E est ) but also with respect to composition 
(dzE est ) illustrates further advantages of such a simple 
model. 

The Coulomb matrix uniquely encodes any compound 
because stoichiomctry as well as atomic configuration are 
explicitly accounted for. Even homometric molecules^l, 
see Fig. [3 are uniquely encoded by M. Symmetrically 
equivalent atoms will contribute equally, and the repre- 
sentation is rotationally and translationally invariant. In 
order to gain invariance of M with respect to the in- 
dex ordering of atoms one can either diagonalizc, sort 
rows and columns according to their norm, or use sets 
of matrices with permutated rows and columns. Using 
the eigenvalues of M will yield an undercomplete rep- 
resentation. As with any coarsened representation, the 
Ni degrees of freedom represented by eigenvalues will 
fail to uniquely represent the full set of 3N] — 6 degrees 
of freedom for any non-linear molecule with more than 
three atom o 131 ' 132 . While sorting by the norm of rows 
(or columns) leads to an overcomplete, index invariant, 
and unique representation, the matrix is no longer dif- 
ferentiable for any combination of matrix entries that 
could be achieved through changes in geometry or in 
nuclear charges. Extending the representation by ran- 
domly permutated variants of Coulomb matrices is fea- 
sible, and leads to dramatic improvement in predictive 



We shall now discuss more sophisticated alternatives 
to the Coulomb-matrix. An intuitive extension is to as- 
sume a matrix with an interatomic potential form. This 
could be worth-while as long as the incurred computa- 
tional overhead is small by comparison to the method 
used to generate the reference data. For example, 







%r-'®)') (21) 



would correspond to the Lennard- Jones analog to the 
Coulomb-matrix. Similarly, a Morse or Buckingham ma- 
trix could be constructed. One could even conceive to 
go beyond such pair-wise approaches and introduce in- 
teratomic 3 and higher order terms in the form of molec- 
ular tensors. But also electronic structure models can 
be encoded in terms of such a representation, such as ex- 
tended Hiickcl theory, semi-empirical quantum chemistry 
or tight-binding models. For example, an orbital free 
Thomas-Fermi DFT representation^ is possible when 
based on a data-base of frozen free atomic electron den- 
sities, {m(r)}. The "Hartree" matrix is given by, 







V I = J. 



fdrdr' " j(r)nj(r,) V 1 + J, 



the "external" potential matrix is given by, 



M e jf 



Jdr 



ni(r)Zj 



V 1 = J, 
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To encode known invariances through and the "kinetic" matrix is given within 



accuracy 

such data extension has also been successful for improv- 
ing the accuracy of handwritten digit recognition-^. Due 
to disadvantageous scaling, this approach might prove 
problematic, however, when it comes to larger systems. 
As discussed in Ref J^i, these are all crucial criteria for 
representing atomistic systems within statistical models. 



r C F Jdr m(r) 



5/3 



V I = J, 

V 1 + J, 



(22) 



(23) 



(24) 



where Cf is a constant, and atomic integrals are eval- 
uated over all of space. If need be, the kinetic matrix 
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could even be extended by the von Weiszacker correc- 
tion term, | f dr |Vn/(r)| 2 /n/(r)^. Summation of all 
entries in the matrix and addition of the off-diagonal 
Coulomb-matrix entries would yield the corresponding 
exact DFT energy for frozen atomic electron densities. 
Unfortunately, preliminary training on atomization en- 
ergies of the GDB-7 data seli^ indicates that neither use 
of the Lennard- Jones nor of the Thomas Fermi matrices 
leads to any significant improvement in predictive ac- 
curacy when compared to the original Coulomb-matrix 
representation in RefJ^i. A possible explanation for this 
non-intuitive result is that these more sophisticated de- 
scriptors are no longer monotonic functions in geometries 
and stoichiometries — in contrast to the Coulomb matrix. 

An alternative new descriptor, entirely consistent with 
the first principles view on CCS, has recently been pro- 
posed^. Each atom / in the molecule is represented 
by its nuclear charge multiplied with a cosine term that 
contains a radial distribution function of atom / with 
respect to all other atoms J. Summing up the atomic 
contributions yields a Fourier series of radial distribution 
functions which, because of the superposition principle, 
is not only unique for each compound, but also invari- 
ant with respect to molecular rotations, translations, and 
atom indexing. 

Nj 1 Ni 

M(d) = j2 z J co ^—Y. Zie ~ id ~ dIJ}2/ ' 7 ^ ( 25 ) 

where du = |Rj — Rj|, and n and a are hyper parame- 
ters that can be optimized. This descriptor has units of 
charge™, d has units of distance and goes from zero be- 
yond the largest interatomic distance. As in the case of 
the Coulomb-matrix described above, the environment 
of large or condensed systems can be accounted for by 
chosing Nj to be larger than Nj. The reader is referred 
to the original paper for further detail*^! 

VI. CONCLUDING REMARKS 

We have reviewed a notion of chemical compound 
space that is consistent with any ab initio approach to 



atomistic simulations. Starting from an energy hierarchy, 
variations in nuclear charge distributions have been dis- 
cussed, followed by order-parameter based interpolation 
approaches, and statistical learning methods. The con- 
cepts presented offer a seamless and rigorous framework 
to unify electronic structure theory with rigorous rational 
as well as combinatorial compound design efforts. This 
view of chemical space is advantageous for several rea- 
sons, (i) equipped with such a notion, important funda- 
mental questions can be tackled in the future, including 
rigorous definitions of diversity in CCS, property trans- 
ferability, uncertainty, and selection bias in training sets; 
(ii) transferability and applicability typical for the black- 
box characteristics and the accuracy of ab initio calcula- 
tions can be achieved; (iii) a mathematically, physically, 
and chemically rigorous notion of relevant input variables 
enables the application of sophisticated property opti- 
mization algorithms. Ultimately, efforts along these lines 
promise to lead to "the right compound for the right 
reason" , promising to replace by systematic engineering 
protocols the heuristics and serendipity on which most, 
if not all, of the past compound discoveries have relied. 
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